About the project

The project seems interesting and i enrolled to the course to learn more about r and open data science. This site will be updated as soon as the tasks are completed and a new ABOUT introduction will appear with more details about what the course contents are and more general questions about open data and open science.

Analysis of Learning Data

The learning data is a dataset collected during 2014 called International survey of Approaches to Learning. The dataset used for these analyses is a modified dataset containing only variables for gender and age of the students, as well as new variables regarding attitude toward statistics, exam points, and scores of deep learning approach, strategic learning approach and surface learning approach.

Raw analyses

First we plotted a correlation matrix with ggpairs to find correlations between variables and to visually explore the distribution of the variables. We observe that age has a left tailed distribution, meaning that the participants are mainly of young age, as expected from the cohort. The other variables follow more or less a normal distribution, apart of from attitude towards learning in men, which is slight right-tailed. We found a positive correlation between attitude and exam points. There is a negative correlation between deep learning approach and superficial learning approach. See plot below:

We obtained descriptive statistics grouped by gender and observed that mean age in males is higher than in female, as well as attitude. The other variables have similar means between the groups. See table below. Performing simple t-tests, we found only significant (p<0.001) for the differences in mean attitude between genders.

group1 mean sd median IQR
gender*1 F 1.0 0.0 1.0 0.0
gender*2 M 2.0 0.0 2.0 0.0
age1 F 24.9 7.4 22.0 6.0
age2 M 26.8 8.4 24.0 8.0
attitude1 F 3.0 0.7 3.0 1.1
attitude2 M 3.4 0.6 3.4 0.8
deep1 F 3.7 0.5 3.7 0.8
deep2 M 3.7 0.6 3.8 0.7
stra1 F 3.2 0.7 3.2 1.1
stra2 M 3.0 0.8 3.0 1.2
surf1 F 2.8 0.5 2.8 0.6
surf2 M 2.7 0.6 2.6 0.9
points1 F 22.3 5.8 23.0 7.0
points2 M 23.5 6.0 23.5 8.2

Multiple linear model

We fitted a linear model for exam points as outcome and attitude, strategy and age as explanatory variables. The summary of the model is presented below. We found that participants with higher scores in attitude had higher exam points (p<0.0001). There was a trend of higher scores in strategy having higher exam points. Controlling for gender did not modified the results of the model substantially. The summary of the model can be found below. The model has an adjusted R-squared of 0.2037, this means the model can expain or predict 20% of the data, which is quite reasonable for survey data. The model can be defined as following Exam points= 10.9+ 3.5attitude + strategy -0.1age

Dependent variable:
points
attitude 3.481***
(0.562)
stra 1.004*
(0.534)
age -0.088*
(0.053)
Constant 10.895***
(2.648)
Observations 166
R2 0.218
Adjusted R2 0.204
Residual Std. Error 5.260 (df = 162)
F Statistic 15.069*** (df = 3; 162)
Note: p<0.1; p<0.05; p<0.01

When making diagnostic plots of our model, we found a random pattern in the residuals vs. fitted plot, meaning no bias, an almost linear qq plot which indicates normality and the residuals vs. leverage plot shows no outliers whcih could affect the modelling.


Chapter 4

Use of Boston data from MASS library

Boston is dataset with 506 observations and 14 variables of housing values in the suburbs of Boston Correlation plot matrix and summaries of the variables can be found below

We observe middle-strong correlations (ca. 0.6) between the variables rad and crim, tax and crim, age and zn, dis and zn, nox and indus, age and indus, dis and indus, rad and indus, tax and indus, lstat and indus, age and nox, dis and nox, rad and nox, tax and nox, lstat and nox, lstat and rm, medv and rm, dis and age, lstat and age, tax and rad, lstat and medv. None of the variables is normally distributed, apart from rm that appears to follow a normal distribution.

Descriptive statistics (summary) of variables in Boston in following table.

mean sd median se IQR
crim 3.6 8.6 0.3 0.4 3.6
zn 11.4 23.3 0.0 1.0 12.5
indus 11.1 6.9 9.7 0.3 12.9
chas 0.1 0.3 0.0 0.0 0.0
nox 0.6 0.1 0.5 0.0 0.2
rm 6.3 0.7 6.2 0.0 0.7
age 68.6 28.1 77.5 1.3 49.0
dis 3.8 2.1 3.2 0.1 3.1
rad 9.5 8.7 5.0 0.4 20.0
tax 408.2 168.5 330.0 7.5 387.0
ptratio 18.5 2.2 19.1 0.1 2.8
black 356.7 91.3 391.4 4.1 20.8
lstat 12.7 7.1 11.4 0.3 10.0
medv 22.5 9.2 21.2 0.4 8.0

Therefore, we standardised the data.

Descriptive statistics (summary) of standardized variables in Boston in following table. Observe the mean O after standardization and standard deviation of 1

mean sd median se IQR
crim 0 1 -0.4 0 0.4
zn 0 1 -0.5 0 0.5
indus 0 1 -0.2 0 1.9
chas 0 1 -0.3 0 0.0
nox 0 1 -0.1 0 1.5
rm 0 1 -0.1 0 1.1
age 0 1 0.3 0 1.7
dis 0 1 -0.3 0 1.5
rad 0 1 -0.5 0 2.3
tax 0 1 -0.5 0 2.3
ptratio 0 1 0.3 0 1.3
black 0 1 0.4 0 0.2
lstat 0 1 -0.2 0 1.4
medv 0 1 -0.1 0 0.9

We fitted a linear discriminant analysis to the target variable crime and its classes. We divided the standardised Boston data set into a training and a test set, with 80% of the data assigned to the training dataset.

We plotted this lda model in the following biplot

## NULL
[-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
[-0.419,-0.411] 18 10 3 0
(-0.411,-0.39] 5 15 4 0
(-0.39,0.00739] 0 11 16 1
(0.00739,9.92] 0 0 0 19

As noted in the biplot the predictor variable rad ( radial highway )predicts a high crime per capita ( blue). On the other hand a high proportion of residential land zoned for lots over 25,000 sq.ft. (variable zn) predicts a low crime per capita. The middle crime per capita depicted in red in green has overlaps and is predicted by multiple predictors.

We predicted classes with the lda model and observed that the lda model predicts very efficently the crime rates above the mean (i.e. higher crime rates), but fails to distinguish the lower crime classes in an effective manner.

We calculated the distances between the observations in the standardised Boston dataset and performed an k means clustering analysis. We obtained two clusters and see that for example rad predicts well the two crime clusters.


ANALYSES

Implement the analyses of Chapter 8 of MABS using the RATS data.

The dataset used for this exercise is nutrition data aqcuired from three different groups of rats (Crowder and Hand, 1990). The rats were given three different nutritional regimes and weight was repeatedly measured. Measurements were made weekly and over a period of 9 weeks. First we shall plot the individual responses, i.e., weight measurements and their changes in time for the three different diet groups.

We can see that there seems to be a difference between the groups, group 1 having lower Weight as Group 2 and 3, but we have to find out if these differences are significant. Also the Weight seems to increase with time. First we have to standardise the Weight values, since we want to take into account the individuals who have already high weight and track them.

Now we will plot the previous plot again, but using the standardised weight.

Note that now weight in groups 2 and 3 do not seem to increase with time as in the unstandardised plot.

Perhaps a summary graphical overview can give us a better insight into the changes of weight in time between groups.

We see more clearly now, that there are difference between the groups of rats, how should we able to better interpret the changes? We can apply a summary measure, in this case would be the overall mean, since we have equal time intervals. To visualise the means, we can use boxplots.

We can see differences in the mean during the observed time. We observe some outliers, which we could remove later. Since we have three groups, we cannot perform a Students t-test to compare the differences in mean, but we can use for this purpose a ANOVA test

Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
Group 2 236732 118366 88 2.8e-08
Residuals 13 17471 1344 NA NA

We observe a statistically significant difference in mean Weight between the groups.

Implement the analyses of Chapter 9 of MABS using the BPRS data.

The BPRS data used is an assessment of a Brief psychiatric rate scale in 40 male participants assigned randomly to two treatment groups and assessed in weekly intervals during eight weeks. The rating scale instrument is used to evaluate patients with suspect of schizophrenia.

First we plot the data to analyse it visually.

We observe no marked differences between the treatments ( different colours). First we can create a simple linear regression model of the BPRS score and the time in weeks and treatment groups

Fitting linear model: bprs ~ week + treatment
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 46 1.4 34 6.3e-114
week -2.3 0.25 -9 1.4e-17
treatment2 0.57 1.3 0.44 0.66

We observe that the weeks affect significantly the BPRS score, for each week of treatment the score drops ca. -2. There is no significant difference between the treatments.

This model does not take into account that the measurements are dependent of each other, depending on the time in weeks. A better estimate can be obtained with a mixed model with a random intercept. This model applies random intercepts depending on the subjects and the subjects are treated as a random effect. The time and the group are fixed.

Mixed model with random intercepts

Dependent variable:
bprs
week -2.270***
(0.208)
treatment2 0.572
(1.076)
Constant 46.454***
(1.909)
Observations 360
Log Likelihood -1,369.356
Akaike Inf. Crit. 2,748.712
Bayesian Inf. Crit. 2,768.143
Note: p<0.1; p<0.05; p<0.01

If observing carefully both the linear and the mixed model gives us similar results for the fixed effects. Here nor shown are the variance of the subjects 47.41 with a standard deviation of 6.885.

We can add random slopes to the model and evaluate if we get a better fit.

Mixed model with random slopes and intercepts

Dependent variable:
bprs
week -2.270***
(0.298)
treatment2 0.572
(1.040)
Constant 46.454***
(2.105)
Observations 360
Log Likelihood -1,365.720
Akaike Inf. Crit. 2,745.440
Bayesian Inf. Crit. 2,772.643
Note: p<0.1; p<0.05; p<0.01

Compare the mixed models

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Df 2 6.000 1.414 5 5.5 6.5 7
AIC 2 2,747.076 2.314 2,745.440 2,746.258 2,747.894 2,748.712
BIC 2 2,770.393 3.182 2,768.143 2,769.268 2,771.518 2,772.643
logLik 2 -1,367.538 2.571 -1,369.356 -1,368.447 -1,366.629 -1,365.720
deviance 2 2,735.076 5.142 2,731.440 2,733.258 2,736.894 2,738.712
Chisq 1 7.272 7.272 7.272 7.272 7.272
Chi Df 1 2.000 2.000 2.000 2.000 2.000
Pr(> Chisq) 1 0.026 0.026 0.026 0.026 0.026

The mixed model with random slopes and random intercepts fits the data better.

Finally, we can fit a random intercept and slope model that allows for a group × time interaction

Compare the model with an interaction

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Df 2 7.500 0.707 7 7.2 7.8 8
AIC 2 2,744.855 0.828 2,744.269 2,744.562 2,745.148 2,745.440
BIC 2 2,774.001 1.920 2,772.643 2,773.322 2,774.679 2,775.358
logLik 2 -1,364.927 1.121 -1,365.720 -1,365.324 -1,364.531 -1,364.135
deviance 2 2,729.855 2.242 2,728.269 2,729.062 2,730.648 2,731.440
Chisq 1 3.171 3.171 3.171 3.171 3.171
Chi Df 1 1.000 1.000 1.000 1.000 1.000
Pr(> Chisq) 1 0.075 0.075 0.075 0.075 0.075

The model with an interaction with time and treatment group fits slightly better the data, but the difference is not significant. We can plot the fitted values

The plot shows clearly how the duration of the treatment in weeks lowers the score and this in a statistically significant way. There are no significant differences between the treatment groups.